Goto

Collaborating Authors

 virtual client


Whispers of Data: Unveiling Label Distributions in Federated Learning Through Virtual Client Simulation

arXiv.org Artificial Intelligence

Federated Learning enables collaborative training of a global model across multiple geographically dispersed clients without the need for data sharing. However, it is susceptible to inference attacks, particularly label inference attacks. Existing studies on label distribution inference exhibits sensitive to the specific settings of the victim client and typically underperforms under defensive strategies. In this study, we propose a novel label distribution inference attack that is stable and adaptable to various scenarios. Specifically, we estimate the size of the victim client's dataset and construct several virtual clients tailored to the victim client. We then quantify the temporal generalization of each class label for the virtual clients and utilize the variation in temporal generalization to train an inference model that predicts the label distribution proportions of the victim client. We validate our approach on multiple datasets, including MNIST, Fashion-MNIST, FER2013, and AG-News. The results demonstrate the superiority of our method compared to state-of-the-art techniques. Furthermore, our attack remains effective even under differential privacy defense mechanisms, underscoring its potential for real-world applications.


FedCert: Federated Accuracy Certification

arXiv.org Artificial Intelligence

Federated Learning (FL) has emerged as a powerful paradigm for training machine learning models in a decentralized manner, preserving data privacy by keeping local data on clients. However, evaluating the robustness of these models against data perturbations on clients remains a significant challenge. Previous studies have assessed the effectiveness of models in centralized training based on certified accuracy, which guarantees that a certain percentage of the model's predictions will remain correct even if the input data is perturbed. However, the challenge of extending these evaluations to FL remains unresolved due to the unknown client's local data. To tackle this challenge, this study proposed a method named FedCert to take the first step toward evaluating the robustness of FL systems. The proposed method is designed to approximate the certified accuracy of a global model based on the certified accuracy and class distribution of each client. Additionally, considering the Non-Independent and Identically Distributed (Non-IID) nature of data in real-world scenarios, we introduce the client grouping algorithm to ensure reliable certified accuracy during the aggregation step of the approximation algorithm. Through theoretical analysis, we demonstrate the effectiveness of FedCert in assessing the robustness and reliability of FL systems. Moreover, experimental results on the CIFAR-10 and CIFAR-100 datasets under various scenarios show that FedCert consistently reduces the estimation error compared to baseline methods. This study offers a solution for evaluating the robustness of FL systems and lays the groundwork for future research to enhance the dependability of decentralized learning. The source code is available at https://github.com/thanhhff/FedCert/.


Learning to Generate Image Embeddings with User-level Differential Privacy

arXiv.org Artificial Intelligence

Representation learning, by training deep neural networks as feature extractors to generate compact embedding vectors from images, is a fundamental component in computer vision. Metric learning, a kind of representation learning using supervised data, has been widely applied to image recognition, clustering, and retrieval [Schroff et al., 2015; Weinberger and Saul, 2009; Weyand et al., 2020]. Machine learning models have the capacity to memorize training data [Carlini et al., 2019, 2021], leading to privacy risks when the models are deployed. Privacy risk can also be audited by membership inference attacks [Carlini et al., 2022; Shokri et al., 2017], i.e. detecting whether certain data was used to train a model and potentially exposing users' usage behaviors. Defending against such risks is a critical responsibility when training on privacy-sensitive data. Differential Privacy (DP) [Dwork et al., 2006] is an extensively used quantifiable measurement of privacy risk, now generally accepted as a standard notion of privacy in both industry and government [Apple Privacy Team, 2017; Ding et al., 2017; McMahan and Thakurta, 2022; US Census Bureau, 2021]. Applied to machine learning, DP requires a training procedure with explicit randomness, and guarantees that the distribution over output models is quantifiably similar given a certain scope of change to the training dataset. A DP guarantee with respect to the change of a single arbitrary training example is known as example-level DP, which provides plausible deniability (in the binary hypothesis testing sense of [Kairouz et al., 2015]) that any single example (e.g., image) occurred The first two authors contributed equally.